4th WEEK: Clustering

The aim of this week’s exercise is to fit a model that can classify suburbs from Boston data set into classes based on their characteristics. Clustering identifies similar groups or clusters of observations without prior knowledge of the them.

Introduction

Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Details about the Boston dataset can be seen for example here. (0-1 points)

Boston data encompass housing values in suburbs of Boston and can be loaded from the R package MASS.According to the ?Boston, the data frame has 506 rows (observations) of 14 columns (variables). Briefly, the data report median values of homes in the Boston area together with several variables potentially explaining the median value across tracts.

There are several interesting variables, e.g.safety (crime rate), air quality (nitrogen oxides), average wealth (tax), education (pupil-teacher ratio) and value of owner owned homes.

  • crim: per capita crime rate by town

  • zn:proportion of residential land zoned for lots over 25,000 sq.ft.

  • indus:proportion of non-retail business acres per town.

  • chas:Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

  • nox:nitrogen oxides concentration (parts per 10 million).

  • rm:average number of rooms per dwelling.

  • age:proportion of owner-occupied units built prior to 1940.

  • dis:weighted mean of distances to five Boston employment centres.

  • rad:index of accessibility to radial highways.

  • tax:full-value property-tax rate per $10,000.

  • ptratio:pupil-teacher ratio by town.

  • black:1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

  • lstat:lower status of the population (percent).

-medv:median value of owner-occupied homes in $1000s

Firstly the data are loaded, glimpsed and thereafter summaries printed.

# load data
data("Boston")
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
kable(summary(Boston[,1:7]))
crim zn indus chas nox rm age
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000 Min. :0.3850 Min. :3.561 Min. : 2.90
1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000 Median :0.5380 Median :6.208 Median : 77.50
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917 Mean :0.5547 Mean :6.285 Mean : 68.57
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000 Max. :0.8710 Max. :8.780 Max. :100.00
kable(summary(Boston[,8:14])) 
dis rad tax ptratio black lstat medv
Min. : 1.130 Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32 Min. : 1.73 Min. : 5.00
1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95 1st Qu.:17.02
Median : 3.207 Median : 5.000 Median :330.0 Median :19.05 Median :391.44 Median :11.36 Median :21.20
Mean : 3.795 Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67 Mean :12.65 Mean :22.53
3rd Qu.: 5.188 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95 3rd Qu.:25.00
Max. :12.127 Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.97 Max. :50.00

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)

#Matrix of plots
# Density plot
par(mfrow=c(3, 3))
colnames <- dimnames(Boston)[[2]]
for (i in 1:14) {
    d <- density(Boston[,i])
    plot(d, type="n", main=colnames[i])
    polygon(d, col="red", border="gray")
}

ggpairs(Boston,       
  #upper = list(continuous = wrap("cor", size = 3)), 
  lower = list(continuous = wrap("points", alpha = .2, size = .6),
               combo = wrap("facethist", bins = 10))) +
  theme(
    axis.text.x = element_text(angle = 90, color = "black", size = 7, vjust = .5),
    axis.text.y = element_text(color = "black", size = 7))
ggpairs(Boston[,8:14],       
  #upper = list(continuous = wrap("cor", size = 3)), 
  lower = list(continuous = wrap("points", alpha = .2, size = .6),
               combo = wrap("facethist", bins = 10))) +
  theme(
    axis.text.x = element_text(angle = 90, color = "black", size = 7, vjust = .5),
    axis.text.y = element_text(color = "black", size = 7))


gather(Boston) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

gather(Boston) %>% ggplot(aes(value)) + 
  facet_wrap("key", scales = "free") + 
  geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

We are most interested in the ‘crim’ variable that descibes the per capita crime rate of the area. The crime rate is a continuous variable that varies highly between areas: the max crime rate is really high compared to the median crime rate. Looking at the quantiles, we can see that this is partly due to outliers, as the 3rd quantile is much smaller than the max value. We note that also the 3rd quantile is already very high compared to the minimum value.

Most of the variables are percents, proportions or values calculated with their own functions. Due to these variable characteristics it is understandable that some of them have rather uneven pr skewed distributions: e.g. proportion of black people (scaled proportion of blacks), indus (proportion of non-retail business acres), age (proportion of owner-occupied units built prior to 1940), proportion of land zond for very large lots (zn) and lstat (lower status of the population (percent)). In addition, crime rate varies a lot between areas, likely at least partly due to some outlier values. Charles River dummy variable (Chas) is binary (1/0) referring the river crossing the area and the radial highways accessibility (rad) an interval scaled index. Dwelling size referring to the number of rooms (rm) is normally distributed and median value of owner-occupied homes distribution can also be judged to be.

To explore the relations between the variables of the dataset, let’s print out pairwise scatter plots and a correlation plot.

pairs(Boston)

ggpairs(Boston,lower=list(combo=wrap("facethist",bins=20)))+ggtitle("All variables")

cor_matrix<-cor(Boston) %>% round(2)
corrplot.mixed(cor_matrix,number.cex=0.65,tl.cex=0.6)

cb <- as.data.frame(cor(Boston)) # Create a DF of the correlation #matrix.
cor_matrix_high <- as.data.frame(matrix(nrow = 14, ncol = 14)) #Copy
colnames(cor_matrix_high) <- colnames(cor_matrix) #the structure of
rownames(cor_matrix_high) <- rownames(cor_matrix) #cor_matrix.
cor_threshold <- 0.7
# Loop through the correlation matrix and save only values that exceed the threshold.
for(col in names(cb)) {
  for(row in 1:length(cb[[col]])) {
    if(abs(cb[[col,row]]) > cor_threshold & abs(cb[[col,row]]) < 1) { 
      cor_matrix_high[col,as.character(rownames(cb)[row])] <- round(cb[[col,row]], digits = 2)
    }
  }
}
# Print the matrix.
cor_matrix_high
##         crim zn indus chas   nox rm   age   dis  rad  tax ptratio black
## crim      NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA    NA
## zn        NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA    NA
## indus     NA NA    NA   NA  0.76 NA    NA -0.71   NA 0.72      NA    NA
## chas      NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA    NA
## nox       NA NA  0.76   NA    NA NA  0.73 -0.77   NA   NA      NA    NA
## rm        NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA    NA
## age       NA NA    NA   NA  0.73 NA    NA -0.75   NA   NA      NA    NA
## dis       NA NA -0.71   NA -0.77 NA -0.75    NA   NA   NA      NA    NA
## rad       NA NA    NA   NA    NA NA    NA    NA   NA 0.91      NA    NA
## tax       NA NA  0.72   NA    NA NA    NA    NA 0.91   NA      NA    NA
## ptratio   NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA    NA
## black     NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA    NA
## lstat     NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA    NA
## medv      NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA    NA
##         lstat  medv
## crim       NA    NA
## zn         NA    NA
## indus      NA    NA
## chas       NA    NA
## nox        NA    NA
## rm         NA    NA
## age        NA    NA
## dis        NA    NA
## rad        NA    NA
## tax        NA    NA
## ptratio    NA    NA
## black      NA    NA
## lstat      NA -0.74
## medv    -0.74    NA
library(xtable)
cor_mat = cor(Boston)%>% round(digits=2)
cor_mat
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
# x is a matrix containing the data
# method : correlation method. "pearson"" or "spearman"" is supported
# removeTriangle : remove upper or lower triangle
# results :  if "html" or "latex"
  # the results will be displayed in html or latex format
corstars <-function(x, method=c("pearson", "spearman"), removeTriangle=c("upper", "lower"),
                     result=c("none", "html", "latex")){
    #Compute correlation matrix
    require(Hmisc)
    x <- as.matrix(x)
    correlation_matrix<-rcorr(x, type=method[1])
    R <- correlation_matrix$r # Matrix of correlation coeficients
    p <- correlation_matrix$P # Matrix of p-value 
    
    ## Define notions for significance levels; spacing is important.
    mystars <- ifelse(p < .0001, "****", ifelse(p < .001, "*** ", ifelse(p < .01, "**  ", ifelse(p < .05, "*   ", "    "))))
    
    ## trunctuate the correlation matrix to two decimal
    R <- format(round(cbind(rep(-1.11, ncol(x)), R), 2))[,-1]
    
    ## build a new matrix that includes the correlations with their apropriate stars
    Rnew <- matrix(paste(R, mystars, sep=""), ncol=ncol(x))
    diag(Rnew) <- paste(diag(R), " ", sep="")
    rownames(Rnew) <- colnames(x)
    colnames(Rnew) <- paste(colnames(x), "", sep="")
    
    ## remove upper triangle of correlation matrix
    if(removeTriangle[1]=="upper"){
      Rnew <- as.matrix(Rnew)
      Rnew[upper.tri(Rnew, diag = TRUE)] <- ""
      Rnew <- as.data.frame(Rnew)
    }
    
    ## remove lower triangle of correlation matrix
    else if(removeTriangle[1]=="lower"){
      Rnew <- as.matrix(Rnew)
      Rnew[lower.tri(Rnew, diag = TRUE)] <- ""
      Rnew <- as.data.frame(Rnew)
    }
    
    ## remove last column and return the correlation matrix
    Rnew <- cbind(Rnew[1:length(Rnew)-1])
    if (result[1]=="none") return(Rnew)
    else{
      if(result[1]=="html") print(xtable(Rnew), type="html")
      else print(xtable(Rnew), type="latex") 
    }
} 
corstars(Boston[,1:14],result ="html")
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:xtable':
## 
##     label, label<-
## The following objects are masked from 'package:dplyr':
## 
##     combine, src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
crim zn indus chas nox rm age dis rad tax ptratio black lstat
crim
zn -0.20****
indus 0.41**** -0.53****
chas -0.06 -0.04 0.06
nox 0.42**** -0.52**** 0.76**** 0.09*
rm -0.22**** 0.31**** -0.39**** 0.09* -0.30****
age 0.35**** -0.57**** 0.64**** 0.09 0.73**** -0.24****
dis -0.38**** 0.66**** -0.71**** -0.10* -0.77**** 0.21**** -0.75****
rad 0.63**** -0.31**** 0.60**** -0.01 0.61**** -0.21**** 0.46**** -0.49****
tax 0.58**** -0.31**** 0.72**** -0.04 0.67**** -0.29**** 0.51**** -0.53**** 0.91****
ptratio 0.29**** -0.39**** 0.38**** -0.12** 0.19**** -0.36**** 0.26**** -0.23**** 0.46**** 0.46****
black -0.39**** 0.18**** -0.36**** 0.05 -0.38**** 0.13** -0.27**** 0.29**** -0.44**** -0.44**** -0.18****
lstat 0.46**** -0.41**** 0.60**** -0.05 0.59**** -0.61**** 0.60**** -0.50**** 0.49**** 0.54**** 0.37**** -0.37****
medv -0.39**** 0.36**** -0.48**** 0.18**** -0.43**** 0.70**** -0.38**** 0.25**** -0.38**** -0.47**** -0.51**** 0.33**** -0.74****

The correlation chart presents both the direction (color) and the magnitude (size of the circle)of the values. Size of the circle varies according to correlation coefficents. The color of the circle indicates whether it is negatively or positively correlating.

The variables seem complexly correlated with each are. There are strong negative correlations between weighted mean distances to five Boston employment centres (dis) and proportion of non-retail business acres per town (indus) / nitrogen oxide (nox) / and older properties (age). There is also a strong negative correlation between lower status of the population (percent) and median value of owner-occupied homes in $1000s (medv). There are strong positive correlations especially between index of accessibility to radial highways (rad) and full-value property-tax rate per $10,000 (tax) / proportion of non-retail business acres per town (indus). Indus is further positively correlated with nitrogen oxide (nox) and full-value property tax-rate (tax).

In here it’s visible that rad (index of accessibility to radial highways) correlates positively to dis (weighted mean of distances to five Boston employment centres) and lstat(lower status of the population (percent)) correlates positively with medv (median value of owner-occupied homes in $1000s).

It can be captured that there are relevant correlations between varibales.the crime rate is correlated with many of the variables. Again, since the crime rate is (and other variables are) highly variable, it is not easy to interpret the correlations from scatter plots. However, from the correlation plot, crime rate is negatively correlated with e.g. housing values and distances to employment centers, and positively correlated with e.g. access to radian highways rad and property tax rate tax. High correlation among other variables than crim are also found.

Variables tax (property taxes) and rad have a remarkably high correlation with each other: 0.91. This might mean that the taxation is at least partially based on the highway accessibility. Other relatively high correlations include e.g. the negative correlation between the amount of nitrogen oxides (nox) and rad (-0.77), positive correlation between industry presence (indus) and nox (0.76) and negative correlation between the proportion of pre-1940s buildings (age) and rad.

Scaling the dataset and categorising crime rate

Linear discriminant analsis is a method to find linear combinations to charachterize variable classes. The data set ndeeds to be standardized, i.e. all variables fit to normal distribution so that the mean of every variable is zero.

Above here is the summary of Boston data frame after standardizing it with scale-function. Compared to previous summary, it is clear that the values have changed. All the values have been minimized in size, which can bee seen best by looking at the min- values. All of them are negative, where as previously they had positive values. Also a remarkable change is that median values have all turned to zeros. This has to do with the standardization function. In standadrization the value, standard score, is counted from function (x- mean)/sd, which leads to no mean values at all. Here is a helpful post about understanding scale function: Understanding scale in r

Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. (0-2 points)

The variables are all continuous so the data can be scaled with one command for the further analysis.

Scaling here means subtracting the column means from the corresponding columns and dividing the difference with standard deviation:

\[ scaled(x) = \frac{x-means(x)}{sd(x)} \]

boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

From the scaled variable of the per capita crime rate by town we create a categorical variable of the crime rate. We use quantiles as a base of categorization. As a result we get a variable that divides the towns of Boston to low, medium low, medium high and high crime rates.

# Create a quantile vector of crim, and use it to create the categorical "crime".
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c('low','med_low','med_high','high'))
# Replace the original unscaled variable.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
table(boston_scaled$crim) # Explore the categorised variable.
## 
##      low  med_low med_high     high 
##      127      126      126      127

Fitting the Model

**Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot. *(0-3 points)**

We’ll use an LDA model to classify the suburbs into crime rate classes. First we divide the dataset into a training dataset and a test dataset. We will perform classification on the training dataset, and then employ the test dataset to see how well the classification performs on new data. We do this by picking random 80% of the dataset for the training dataset, and then leave the rest as the test dataset. Test dataset does not inlude the crime rates; these are stored separately. To obtain the train set 80% of the observations are gathered. The remaining 20% form the test set. Thus, the train set has 404 and the test set 102 variables.

In my LDA (linear disrciminant analysis) the categorical crime variable is the target and all the other variables are predictor variables. I fit the lda model to the train set and try to predict the classes in the test set.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

As we used quantiles to categorise the original variable, we’ve four classes. Thus, the output shows that we’ve three linear discriminants, as expected. Of these, the first explains vast majority - 94% - of the between-group variance.

The first two of the model’s linear discriminants can be visualised follows. A helper function is needed to draw the arrows in the biplot:

# Define a function for the biplot arrows.
lda.arrows <- function(x, myscale = 2, arrow_heads = 0.2, color = "deeppink", tex = 1, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime) # Turn the classes to numeric for plotting.
plot(lda.fit, dimen = 2, col = classes, pch = classes) # Plot.
lda.arrows(lda.fit) # Add arrows.

It can be seen that the high crime rate class reall stands out. The rest remains together.

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       18       8        1    0
##   med_low    3      22        8    0
##   med_high   0       3       13    1
##   high       0       0        0   25

Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results. (0-3 points)

The categorical crime rate variable is removed from the test set. Now we predict the classes in the test data with the LDA model. From the crosstabulation on the correct and predicted obseravtions we are able to see that the prediction is mainly usable. The high rates all all prideicted right, and almost all the medium high observations as well. The medium low predictions are the most problematic.

Most of the predictions are at the diagonal of the cross tabulation. Prediction error was about 25%. This is probably an ok result

(I used ```addmargins() when tabulating, because in my opinion that’s more illustrative and helps. comparisons.) As seen from the table, the model did predict the highest of crime rates reliably, but the “med_low” category is overrepresented relative to the “low” and “med_high” categories. Thus, the model can be used to make crude predictions, but it’s hardly perfect. It might be better to use an unsupervised method and cluster the data instead of classifying it.

From test data set the variable crime has been dropped and saved separately. Above is the prediction run on our LDA model. From predicted values, 67/102 were predicted the same as the correct values. The best prediction was done on high- values: 22/24. As seen from the graph before, this was expected, since high values differ from the rest of the values. Worst prediction was done on medium low- values 13/29, which is approximately 44% predicted correctly. Looking back at the graph above, medium low- values are most mixed with low and medium high- values, making predicting difficult.

K means

Reload the Boston dataset and standardize the dataset (we did not do this in the Datacamp exercises, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)

With K-means clustering, we are aiming to find clusters of similar observations from the data, without prior knowledge of these clusters. For clustering, we’ll use the scaled Boston dataset, so that the distances are comparable. The following print out summaries of the Euclidian and Manhattan distances matrices of the scaled Boston dataset. Euclidian in the geometric distance, Manhattan is the distane measured along the axes.

Unlike LDA, K-means is a clustering method that divides observations into clusters. The number of clusters is determined beforehand by adding number of centers as attribute to K-means function.

First, I reloaded the Boston data frame and scaled it again. Inorder execute K-means, the distances of observation need to be calculated. I used here eucledian, since it’s the most popular distance measurement type. Moving on to k-means clustering. I recall the data and scale it to examine distance properties of the data. I check out the euclidean (normal way of understanding distances) and manhattan (as if you lived in manhattan and had to move via the grid of streets) distances

Finally we reload the Boston dataset and standardize it. Then we calculate the distances between the observations. Summary of the Euclidean distances is presented belo.Then we calculate the distances between the observations using Euclidean distance measure and use the K-means clustering method with 5 clusters. With K-means clustering, we are aiming to find clusters of similar observations from the data, without prior knowledge of these clusters. For clustering, we’ll use the scaled Boston dataset, so that the distances are comparable. The following print out summaries of the Euclidian and Manhattan distances matrices of the scaled Boston dataset. Euclidian in the geometric distance, Manhattan is the distane measured along the axes.

data("Boston")
#Center and standardize variables and make it a data frame
boston_scaled<-as.data.frame(scale(Boston))
#Euclidean disntance matrix
dist_eu<-dist(boston_scaled)

For the k-means on the distances I use 5 clusters.

km <- kmeans(dist_eu, centers = 5)
pairs(boston_scaled, col = km$cluster, lower.panel = NULL)

Let’s determine the optimal number of clusters using the within cluster sum of squares (WCSS).

set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
#Plot the results
plot(1:k_max, twcss, type = 'b')

For the purpose of finding optimal number of clusters, we’ll explore the total within cluster sum of squares (twcss) with the number of clusters ranging from 1 to 10. The most prominent change happens at 2 refering to choosing 2 as the right number of clusters. The k-means clustering is carried out again. The interpretation the the twcss number is, that the optimal number of clusters is when the twcss drops radically. There, it obviously drops radically when K changes from 1 to 2. Also 2 to 3 might be interpreted as a radical drop (the situation depends on the locations of the initial (random) cluster centers, but this drop was about 25% of twcss when initial centers were assigned using the set.seed(123) function as above). We could stick to K=2 as the optimal, but this is debatable.

So how to determine the number of clusters? Let’s run a set of tests to see if we can find some consensus. Used in this way, NbClust performs a number of tests and reports the best number of clusters based on majority rule.

With K=2 clusters, we can see from the pairwise scatter plots that some of the variables are clearly divided between the clusters (in the sense of these pairwise plots), some are not.

The crime rate, for example, is clearly divided so that one of the clusters inlcudes only low rime rates, the other inlcludes both low and high crime rates. Some of the other variables are clearly dichotomous in when plotted against others - for example, rad and tax feature such tendency.

km <- kmeans(dist_eu, centers = 2)
pairs(boston_scaled, col = km$cluster, lower.panel = NULL)

Some variables seem to be more relevant for clustering than others. For example points with low tax and rad values belong to the red cluster, while points with high values belong to the black cluster. On the other hand some variables as chas are not clearly divided. This means that chas is not relevant for clustering. As the plots above demonstrate, there seems to be less overlap between the clusters than with four clusters, which suggests that at least when using euclidian distance, the optimal number of clusters is indeed two. Because it’s possible that different distance measures produce different results, I also briefly tested my code by creating the dist_eu variable using the manhattan distance method instead, but found the results to be in this case so similar to the euclidian method that it’s not worth repeating those results here. The most notable difference was that with the manhattan method, the TWCSS plot hinted even more strongly that the optimal number of clusters is two. With K=2 clusters, we can see from the pairwise scatter plots that some of the variables are clearly divided between the clusters (in the sense of these pairwise plots), some are not.

The crime rate, for example, is clearly divided so that one of the clusters inlcudes only low rime rates, the other inlcludes both low and high crime rates. Some of the other variables are clearly dichotomous in when plotted against others - for example, rad and tax feature such tendency.

As the plots above demonstrate, there seems to be less overlap between the clusters than with four clusters, which suggests that at least when using euclidian distance, the optimal number of clusters is indeed two. Because it’s possible that different distance measures produce different results, I also briefly tested my code by creating the dist_eu variable using the manhattan distance method instead, but found the results to be in this case so similar to the euclidian method that it’s not worth repeating those results here. The most notable difference was that with the manhattan method, the TWCSS plot hinted even more strongly that the optimal number of clusters is two.

From the later figure it would be possible to ivestigate the observations divided to two clusters according to every variable within the analysis. Personally I do not find this kind of giant figure matrixes very informative, bacause their readability is very low. Maybe this would be helpful when making decisions of further analysis. If the colours in the figure are very mixed, the variables are not very significant considering the clusters. If the colours seems to be well separated, the variables in the figure are affecting in the formulation of the clusters.

The new visualization with only two clusters looks different to the one drawn before. Crime rate clearly (with a few exceptions) belongs to one cluster seen in red and proportion of residential owned lots (zn) to other cluster coloured black. To the same cluster with crime rate belong most of the observations from lower status of population (lstat), proportion of blacks (black) and nitrogen oxide concentration (nox). This means that part of town with high crime rate has also most likely lower status population, black people and high amounts of nitrogen oxide in the air. Areas with high amount of residential owned plots are likely to have accessibility to radial highways (rad), high full value tax rate (tax) and more pupils towards one teacher.

Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters? (0-2 points to compensate any loss of points from the above exercises)

data("Boston")
boston_scaled <- scale(Boston)
dist_eu <- dist(Boston)
km <-kmeans(dist_eu, centers = 4)

pairs(Boston, col = km$cluster)

lda.fit2 <- lda(km$cluster ~., data = Boston)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
plot(lda.fit2, dimen = 2, col= classes, pch=classes)
lda.arrows(lda.fit2, myscale = 2)
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped

The graph above is LDA model with the K-means clusters as target variable and Boston data frame as data. The variable with the longest arrow is nitrogen oxides concentration (nox) and it divides the data frames observations into two separate areas. From the other variables it is harder to say which one could be influential linear separator, so let’s look at the graphs vectors more closely.

plot(lda.fit2, dimen = 2, col= classes, pch=classes)
lda.arrows(lda.fit2, myscale = 10)

**Super-Bonus: Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.

model_predictors <- dplyr::select(train, -crime) # check the dimensions dim(model_predictors) dim(lda.fit\(scaling) # matrix multiplication matrix_product <- as.matrix(model_predictors) %*% lda.fit\)scaling matrix_product <- as.data.frame(matrix_product) Next, install and access the plotly package. Create a 3D plot (Cool!) of the columns of the matrix product by typing the code below.

plot_ly(x = matrix_product\(LD1, y = matrix_product\)LD2, z = matrix_product$LD3, type= ‘scatter3d’, mode=‘markers’) Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities? (0-3 points to compensate any loss of points from the above exercises)**

plot(lda.fit2, dimen = 2, col= classes, pch=classes)
lda.arrows(lda.fit2, myscale = 10)

model_predictors <- dplyr::select(train, -crime)

dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)


library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:Hmisc':
## 
##     subplot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)

As a super bonus task I ran the given code that makes the three matrix products from previous LDA (not the one in Bonus task) and inserts the into plot_ly function. The result is this beautiful 3D graph. If you move your mouse over the observations, you can see their exact coordinates.